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ABSTRACT 

An algorithm is developed to model the three-dimensional velocity distribu- 
tion function of a sample of stars using only measurements of each star's two- 
dimensional tangential velocity. The algorithm works with "missing data" : it re- 
constructs the three-dimensional distribution from data (velocity measurements) 
every one of which has one dimension unmeasured (the radial direction). It also 
accounts for covariant measurement uncertainties on the tangential velocity com- 
ponents. The algorithm is applied to tangential velocities measured in a kinemat- 
ically unbiased sample of 11,865 stars taken from the Hipparcos catalog, chosen 
to lie on the main sequence and have well- measured parallaxes. The local stellar 
velocity distribution function of each of a set of 20 color-selected subsamples is 
modeled as a mixture of two three-dimensional Gaussian ellipsoids of arbitrary 
relative responsibility. In the fitting, one Gaussian (the "halo") is fixed at the 
known mean velocity and velocity variance tensor of the Galaxy halo, and the 
other (the "disk" ) is allowed to take arbitrary mean and arbitrary variance tensor. 
The mean and variance tensor (commonly the "velocity ellipsoid") of the disk 
velocity distribution are both found to be strong functions of stellar color, with 
long-lived populations showing larger velocity dispersion, slower mean rotation 
velocity, and smaller vertex deviation than short-lived populations. The local 
standard of rest (LSR) is inferred in the usual way and the Sun's motion relative 
to the LSR is found to be (U, V, W) Q = (10.1,4.0,6.7) ± (0.5,0.8,0.2) kms^ 1 . 
Artificial data sets are made and analyzed, with the same error properties as 
the Hipparcos data, to demonstrate that the analysis is unbiased. The results 
are shown to be insensitive to the assumption that the velocity distributions are 
Gaussian. 
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Subject headings: Galaxy: fundamental parameters — Galaxy: kinematics and 
dynamics — methods: statistical — solar neighborhood — stars: kinematics 

1. Introduction 

The classical picture of the evolution of the velocity structure in the Galactic disk is that 
stars are born within low-dispersion clusters from cool gas on near-circular orbits. These 
clusters evaporate and the stellar orbit distribution is heated through gravitational pertur- 
bations to the smooth disk potential. Over time a stellar population's velocity dispersion 
grows and its mean motion lags behind that of pure circular orbits at the same Galacto- 
centric radius. Thus, the velocity distribution of stars in the Solar Neighborhood has been 
characterized as an ellipsoid whose centroid, size and orientation varies systematically with 
the lifetimes (and hence colors) of the stars under investigation (e.g., Dehnen & Binney 
1998). 

This field has undergone a recent renaissance with the release of the Hipparcos data set 
of proper motions and parallaxes, measured with accuracies of a few milliarcseconds. Studies 
using these data to analyze the local velocity distribution of stars can be broadly split into 
two categories: 

First, there are determinations of the moments of the velocity distribution as a function 
of color assuming (as above) that it can be described by a mean velocity and a single velocity 
dispersion tensor (Dehnen & Binney 1998; Bienayme 1999). These have led to more stringent 
limits on the Solar Motion relative to a (hypothetical) zero-dispersion population (the Local 
Standard of Rest), the age of the Galactic disk and rate of heating of stellar populations 
(Binney et al. 2000). 

Secondly, there are non-parametric derivations of the full three dimensional velocity 
distribution function (Dehnen 1998; Skuljan et al. 1999; Chereul et al. 1998). These have 
revealed that the velocity distribution is poorly described by a single ellipsoid; in fact it 
contains significant structures on smaller velocity scales. Importantly, the structures do not 
seem to be dominated by short-lived stars. These structures can be variously interpreted as 
trails from evaporating clusters (Chereul et al. 1999), or overdensities induced by resonances 
in the disk associated with the bar (Dehnen 2000; Fux 2001) and/or spiral arms (Quillen 
2003). 

Our long-term goal is to pursue the latter category of project; i.e., to develop algorithms 
to locate, understand the significance of, and characterize, nontrivial structures in the ve- 
locity distribution, not just in the local Galaxy but in the Galaxy halo. These projects will 
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require new space-based astrometry data (e.g., what we expect from the upcoming GAIA 
mission) in combination with large ground-based surveys (e.g., Sloan Digital Sky Survey, 
York et al. 2000; Two-Micron All-Sky Survey, Skrutskie et al. 1997; Grid Giant Star Survey, 
Majewski et al. 2000). In the short term, we have begun by analyzing the Hipparcos data 
set with a very general algorithm for fitting distribution functions to data measured with 
nontrivial error covariances and missing information. 

From a computer science or nonlinear statistics perspective, these problems fall into the 
category of "missing data" problems, in which one constructs a model of an object (here 
the velocity distribution function) using data points (here tangential velocities) every one 
of which is incomplete (because, in this case, it has no radial information). We present 
a framework for a large set of algorithms for solving such problems, and the details of the 
specific restriction to the velocity distribution function as measured with velocities projected 
onto the sphere. 

In this paper we further restrict our attention on the trial problem of re-deriving the 
properties of the velocity ellipsoid near the Sun. In what follows, it is assumed that any 
color-selected, kinematically unbiased sample of stars has a velocity distribution function 
which can be modeled (for the purposes of measuring its velocity variance) by a sum of two 
Gaussan ellipsoids, one for "halo stars" and one for "disk stars", and later, by a sum or 
mixture of K > 2 Gaussian ellipsoids. Model parameters are chosen to maximize the total 
likelihood of the Hipparcos measurements (which, in this case, are two-dimensional tangential 
velocity vectors), given their uncertainties (which are two-dimensional covariance tensors); 
i.e., the results presented here represent the optimization of an explicit, justified, scalar 
objective function. Our work differs from previous work in several respects: we have the 
scalar objective function, we present tests of the algorithm with relatively realistic artificial 
data, and we relax the Gaussian assumption (i.e., expand the space of allowed distribution 
functions). 

In later papers in this series, we intend to generalize our parameterization (to multi- 
modal disk distributions), locate velocity-space structures, measure their statistical signif- 
icance, and characterize their properties. This phenomenology will be essential for distin- 
guishing the various pictures for the origin of the velocity sub-structure in the Galaxy disk. 

2. Model and algorithm 

In what follows the standard Galactic velocity coordinate system is used, with directions 
x, y, and z (and associated unit vectors x, y, and z) pointing towards the Galactic center, 
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pointing in the direction of circular orbital motion, and pointing towards the north Galactic 
pole, respectively. Vectors will be implicitly defined to be column matrices, so a T b is the 
scalar product and a b T is a rank-2 tensor. The components x T v, y T v, and z T v of a velocity 
v are conventionally named U U" , "V" , and ' W . 

We treat any color-selected population of stars from Hipparcos as being composed of two 
kinematically distinct populations of stars, a "halo" population with velocity distribution 
described by a Gaussian ellipsoid in velocity space with a mean velocity v^aio with respect 
to the Sun and velocity dispersion (variance) tensor V ha i , with these parameters fixed at 

v halo = [-220 kms" 1 ]? 

V halo = [lOOkms-'HxxT + yyT + z^] (1) 

(Sirko et al. 2004), plus a "disk" population described by another Gaussian ellipsoid with 
mean v^isk and dispersion tensor V<ji s k, both of which are allowed to vary arbitrarily. The 
relative amplitude ahaio of the halo Gaussian (i.e., the fraction of stars in the halo) is also 
allowed to vary arbitrarily. Sensitivity of the results to the assumed halo velocity dispersion 
of 100 kms -1 is discussed below. 

The vast majority (~ 99 percent) of the sample is expected to be members of the disk 
population. However, the inclusion of a halo Gaussian prevents halo stars from distorting the 
measurement of the disk velocity variance. In effect, the halo Gaussian "clips out" velocity 
outliers in a responsible way. 

Almost all the difficulty in inferring the parameters of this model, i.e., v^sk (three 
parameters), Vdisk (six parameters), and the relative responsibility of the halo Gaussian 
(one parameter), comes from the fact that Hipparcos does not measure the total three-space 
velocity v of each star, but only its two-dimensional tangential projection. 



2.1. Model generalities 

The approach developed here is extremely general and can be applied to many different 
density estimation tasks in the presence of partially observed data. The assumption is that 
there are low-dimension observations Wj, which are noisy projections of higher-dimension 
"true values" v,: 

Wj = Rj Vj + noise , (2) 

where the Rj are known, non-square (or zero-determinant) projection matrices, and the noise 
is drawn from a Gaussian with zero mean and known (low-dimension) covariance tensor S«. 
It is also assumed that the Vj are drawn independently and identically distributed from a 
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probability distribution function p(v) in the higher-dimension space. The goal is to fit a 
model for p(v) using only the incomplete observations {wj}, their covariances {Sj} and the 
projection matrices {R«}. 

Note that there is no assumption that all data points have similar non-square projection 
matrices; in fact the projection matrices (and thus the observations) may have different 
dimensionalities. 

The density model p(v) is parameterized as a mixture of K Gaussians: 



K 

P(v) = 5>^( v K' V i) , (3) 

i=i 

where the amplitudes or "rates" aj sum to unity and the function jV(v|m, V) is the normal 
(Gaussian) distribution with mean m and variance tensor V. 

For a known projection matrix Rj and noise covariance S« in w space (the lower- 
dimensional space of the observations) each component of the mixture marginalizes to a 
lower-dimensional Gaussian, and so the induced density is a conditional mixture of Gaus- 
sians on w: 



p(w,v,j) = p(w\v)p(v\j)p(j) 

p(w|r,s) = / p( w l v M v li)p(i) dv 

J v 



p(w|v,R,S) = W(w|Rv,S) 

K 

p(wi\Ri,Si) = ^ aij M(wj | R^ uij , Tjj ) 

Tij = Rj Vj RJ + Sj , (4) 

where functions like p(x, y, z) are joint probability distribution functions of x, y, and z, and 
functions like p(x\y) are probability distribution functions of x given (or at a specific value 
of) y. All other symbols are described above, except T^-, which is the combined variance 
for each measurement % under the assumption that it is drawn from Gaussian j, with part 
of the variance coming from the (projected) variance Vj of the Gaussian, and part coming 
from the measurement uncertainty variance Sj. 

This model will be called the "projected mixture of Gaussians" model hereafter. 



The objective of the fitting procedure is to maximize the conditional likelihood of the 
entire set of low-dimensional projected observations given the nonsquare matrices and the 
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error covariances. In particular, we are fitting for the means {m,}, variance tensors {Vj} and 
amplitudes {ctj} of the mixture of Gaussians in the high-dimensional (unobserved) space. 
Assuming the noise on each observation is independent of other observation noises this (log) 
likelihood is 

K 

<f) = ^2lnp(w i \R i ,S i ) = ^2ln^2a j N(w i \R i m j ,T ij ) . (5) 

i i j=l 

The model parameters can be optimized in several ways. One approach is to directly 
compute gradients and use a generic optmizer to ascend the objective; this is complicated by 
restrictions on the parameters (e.g. the variances must be symmetric and positive-definite, 
the amplitudes must be non- negative and sum to unity). Another approach is to view the 
high-dimensional quantities as hidden variables and use the expectation-maximization (EM) 
algorithm (Dempster et al. 1977) to iteratively maximize the likelihood function. We take 
the latter approach; for details see the appendix. 



2.2. Hipparcos measurements and their uncertainties 

The sample used in this study is a kinematically unbiased sample of 11,865 nearby main- 
sequence stars (Dehnen & Binney 1998) from the Hipparcos catalog (ESA 1997), all chosen 
to have parallaxes measured at S/N = 7i/a w > 10. We made no corrections for Galactic 
rotation (since this study is simply of stellar velocities relative to the Sun); indeed we did 
no processing or correction of the Hipparcos data beyond making the sample cut. 

The "low dimension" data w, referred to above are the measured tangential components 
of each star's "high dimension" true three-dimensional velocity Vj, with 

Wj = Ri Vj 

Ri = [VlJ + UhJ] , (6) 

where the Rj are non-square or zero-determinant matrices, and lj and bj are the tangential 
unit vectors pointing in the Galactic latitude and longitude directions for each star. 



The Wj are constructed from the Hipparcos measurements (parallax and proper motion) 

as 

rv. T Ah . Ah- ^ 1 

(7) 



ro 

Wi = — 



, dk - dbi ~ 
cos bi — k + — hi 
at at 



where r is the radius of the Earth's orbit, 7ii is the star's parallax, and U and 6, are its Galac- 
tic latitude and longitude, and the cosfoj factor takes into account the spherical geometry. 
This calculation ignores the Lutz-Kelker bias (Lutz & Kelker 1973), but this is small for the 
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sample used here. Since the Hipparcos catalog reports proper motions in equatorial rather 
than Galactic coordinates, the above requires a rotation depending on each star's angular 
position on the sky and the epoch (1991.25) of the catalog positions. 

The Hipparcos catalog entries, which can be represented by some vector Cj for each star, 
come with single-star uncertainty covariance matrices Cj. If we can represent the derivative 
of the Wj with respect to the catalog entries Cj by a matrix Qj 

dwi = Qj dQ , (8) 

then the measurement uncertainty covariances S, for the w, are given by 

S, = Q l QQ l T . (9) 

This is accurate only in the limit of small parallax errors, which is fine for this sample. In 
addition, this whole procedure ignores star-to-star covariances, which could be significant, 
but which are not reported in the Hipparcos catalog. 



3. Results 

Following the general approach of Dehnen & Binney (1998), 20 color-selected subsamples 
of stars (;=» 594 stars each) were made by cutting the color-sorted star list into equal-sized 
pieces (as closely as possible), and, for each subsample, the 10 parameters Vdisk; Xji S k, and 
«haio were found by the optimization described above. Figure 1 shows the 10 parameters 
for each of the 20 subsamples. The vertical error-bars on the points indicate uncertainties 
computed with 20 independent bootstrap resamplings of the data in each of the subsamples. 
Redder (and therefore longer-lived) stellar populations have larger velocity dispersions. 

Table 1 gives the 10 parameters, uncertainties, and uncertainty correlation matrix, for 
one of the subsamples. Figure 2 shows the mean Vdisk of the disk velocity distribution 
as a function of the trace tr(Vdi sk ) of its variance tensor Vdi sk - The mean velocity in the 
y direction is a strong function of velocity dispersion; this linear dependence justifies the 
standard methodology for determination of the local standard of rest (LSR). 

Operationally, the LSR is defined to be the mean velocity for a hypothetical population 
of stars with zero velocity distribution, i.e., the extrapolation to tr(V disk ) = of the trend 
shown in Figure 2. The points in Figure 2 have significant uncertainties in both dimensions, 
so fitting a line responsibly is not trivial. For this purpose we use again the projected 
mixtures of Gaussians procedure described above, but now there are 19 2-dimensional data 
points Wj, the Wj and are the same (i.e., the R; are the identity matrices), and we only 
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fit a single Gaussian ellipsoid. The straight line shown in the y T v disk (v y ) panel of Figure 2 
is the principal eigenvector of the best-fit Gaussian. The x T v disk and z T v disk (v x and v z ) 
panels show simply weighted averages. The errors in the fit are computed by bootstrap 
resampling the 19 samples themselves. 

The fits shown in Figure 2 provide an intercept corresponding to the estimated velocity 
relative to the Sun of a hypothetical population with vanishing velocity dispersion. The 
velocity of the Sun relative to this LSR is therefore 

v = -v LSR (10) 

v = [10.1 ±0.5 kms _1 ]x+ [4.0 ±0.8 kms' 1 ]y+ [6.7 ±0.2 kms _1 ]z . (11) 

Recall that rejection of halo stars was accomplished by fitting the velocity field with two 
Gaussians, one of which was fixed at the halo parameters given in equation (1). Re-fitting 
with the halo velocity dispersion increased to 150 kms^ 1 changes the inferred LSR by much 
less than the magnitude of its uncertainty. 

The vertex deviation is defined to be the angle between the x axis and the projection 
onto the x-y plane of the eigenvector corresponding to the largest eigenvalue of the velocity 
variance tensor V disk . The vertex deviation is shown as a function of stellar color for the 20 
subsamples in Figure 3. 

4. Algorithm tests 

To test the algorithm, 20 subsamples of artificial 3-dimensional stellar velocities v» were 
generated with a mixtures of Gaussians random sample generator. The distribution was 
made with two Gaussians, one for the halo, with parameters as assumed above, and one 
for the disk, with mean v d i sk and variance tensor V d i sk different for each subsample. The 
artificial Vj were projected into artificial measurements Wj using the same Rj as in the real 
20 subsamples, and errors were added, drawn from two-dimensional Gaussian ellipsoids with 
the same variances as the measurement uncertainty covariance tensors Sj. I.e., the artificial 
data were given all of the observational properties of the real data (modulo the assumptions 
of this study). 

For subsamples with mean color (B — V) < 0.1 mag, the artificial variance tensor V disk 
was set to the measured value (shown in Figure 1) for the subsample with (B—V) 0.05 mag, 
and for subsamples with mean (B — V) > 0.6 mag, the artificial tensor V disk was set to the 
measured value for the subsample with (B — V) ~ 0.67 mag. In between, i.e., for artificial 
subsamples with mean color 0.1 < (B — V) < 0.6 mag, the variance tensor was made to vary 
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quadratically with color, so as to approximate the appearance of the true observations (and 
span the range of observed variance tensors. The artificial mean v disk was set to a linear 
function of the trace tr(V disk ) of the variance. 

Exactly the same fitting code and bootstrap analysis was applied to the artificial data as 
was applied to the real data. The results are shown in Figures 4 and 5, along with the input 
values used to make the artificial data. The best-fit parameters are, except for a couple of 
samples, within one standard deviation of the input paramters. More importantly for present 
purposes, the LSR is very well determined; the algorithm returns the correct LSR velocity 
to well within one standard deviation. We conclude that the algorithm is not significantly 
biased. 



5. Generalized mult i- Gaussian disk 

Perhaps the biggest limitation of LSR measurements like this one is that the disk velocity 
distribution function is far from Gaussian; it is not even unimodal (Dehnen 1998; Skuljan 
et al. 1999; Chereul et al. 1998); one of the primary goals of our future work is to explore the 
complexities of disk star velocities. As a baby step towards checking the influence of disk 
velocity non-Gaussianity on the LSR determination, the model was generalized to allow for 
not just one Gaussian ellipsoid to fit the each color subsample's disk velocity distribution 
function but -Kdisk > 1 Gaussians, all constrained to have the same mean. Models with 
-ft'disk > 1 have the freedom to have larger "tails" to the velocity distribution, and for those 
tails to be rotated or twisted in velocity space relative to the core of the velocity distribution. 

The generalized model is optimized by an algorithm constructed exactly parallel to that 
of the -Kdisk = 1 model, but now there are 4 + 6 i^disk free parameters for each subsample. 

Increasing -Kdisk increases the goodness-of-fit (of course), but at very large i^disk, the 
data will be "overfit" . To determine the optimal value of -Kdisk for each of the 20 stellar sub- 
samples (the optimal -Kdisk will, in general, be different for different subsamples, of course), 
a "jackknife likelihood" was computed: For each of 5 iterations, a randomly selected 10 per- 
cent of each subsample was removed and put aside as a "test set." Fitting (i.e., parameter 
determination by maximum likelihood) was performed on the remaining 90 percent, and the 
likelihood of the test set was tested within the context of the best-fit model. The logarithms 
of the jackknife likelihoods for the five iterations were averaged and the i^disk with the best 
jackknife likelihood was chosen for each subsample. 

Figure 6 shows the LSR determination when the generalized model is used and each 
sample is fit with the optimal jackknife likelihood value of K disk . The velocity of the Sun 
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relative to the LSR we find when using the optimal K&sk values is 

v = [10.2 ±0.5 kms" 1 ]x+ [4.0 ± 0.8 kms" 1 ]y+ [6.7 ± 0.2 kms _1 ]z . (12) 

This is extremely similar to (much closer than 1 standard deviation away from) that found 
using -ft'disk = 1 (Figure 2). This suggests that that the assumption of Gaussianity is not 
strongly affecting the results. 



6. Discussion 

In the above, we developed and used a novel algorithm to infer the three-dimensional 
velocity distribution from a kinematically unbiased sample of Hipparcos stars. 

The local velocity dispersion is a strong function of stellar color, and the mean velocity 
of a color-selected stellar population is a linear function of its velocity variance; this confirms 
previous results (e.g., Dehnen & Binney 1998). The extrapolation of this relation to zero 
velocity dispersion provides an estimate of the local standard of rest (LSR), which is found 
to be 

v = [10.1 ±0.5 kms- 1 ]x+ [4.0 ±0.8 km S - 1 ]y+ [6.7 ±0.2 kms^z , (13) 

where x, y, and z are unit vectors pointing in the directions of the standard Galactic velocity 
components U, V, and W. This result is similar to previouw LSR determinations; we 
compare this result to one previous study below. Our answer did not change much when we 
relaxed the assumption that the disk star velocity distributions can be modeled as Gaussians. 

We also showed that it is possible to robustly and reliably solve a missing data problem 
in astrophysics: the reconstruction of aspects of the three-dimensional velocity distribution 
function from individual velocity measurements every one of which is missing data in the 
radial direction. 

Dehnen & Binney (1998), using the same subsample of the same data set, find a some- 
what different Solar velocity relative to the LSR; they find 

v = [10.00 ±0.36 kms" 1 ]x± [5.25 ±0.62 kms" 1 ] y ± [7.17 ± 0.38 kms _1 ]z , (14) 

They also find a lower mean velocity variance for the long-lived disk stars. These two 
differences are probably related, since the LSR is determined by fitting the relationship 
between velocity and velocity variance. Although the studies agree to within about one 
standard deviation, better agreement might be expected since both studies are using identical 
data subsets of the same data set. Both studies have made the incorrect assumption that 
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the stellar velocity distribution is Gaussian; Dehnen & Binney (1998) did so in subtracting 
a measurement uncertainty variance from the measured velocity variance. The method 
presented here has been shown (using the generalized multi-Gaussian disk model) to be 
insensitive to the Gaussianity of the velocity distribution. The incorrect assumption of 
Gaussianity, entering differently in the Dehnen & Binney (1998) investigation, may account 
for the difference between the results. Because our method involves the optimization of a 
well-defined objective, because we have tested our method successfully with artificial data, 
and because we have been able to relax the Gaussian assumption, we prefer our result. 
Certainly the differences show that stellar velocity studies have become precise enough that 
algorithms matter. However, it must be emphasized that the true velocity distribution is far 
from a unimodal Gaussian (Dehnen 1998; Skuljan et al. 1999; Chereul et al. 1998), so it is 
not clear that it is possible to make a "correct" LSR determination at all. 
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A. Fitting Mixtures with Incomplete Data using the EM algorithm 

The EM algorithm (Dempster et al. 1977) can be used to optimize the likelihood function 
of a probabilistic model involving incomplete observation data (hidden variables). Starting 
from user-supplied starting parameters, its iterations generate a sequence of parameters 
which monotonically increase the likelihood of a fixed data set under the model; thus it finds 
locally maximum likelihood parameters. 

EM proceeds by optimizing, at each point in parameter space, a new function which 
is a strict lower bound on the data likelihood. This new functions depends on the original 
model parameters as well as on some extra auxiliary quantities introduced by EM. The EM 
algorithm iteratively increases the lower bound by coordinate ascent: first (the "M step") 
the original model parameters are optimized (holding the auxiliary quantities fixed) and then 
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(the "E step") the auxiliary quantities are optimized (with the parameters fixed). After the 
optimization of the auxiliary quantities, the new function actually becomes equal to the true 
model likelihood (the bound becomes tight); thus at each iteration this true likelihood is 
nondecreasing. The M and E steps are iterated to convergence, which, here as usually, is 
identified by extremely small incremental improvement in the logarithm of the likelihood per 
iteration. 

In particular, for any auxiliary distribution q{v,j\w) we can lower bound the model 
likelihood lnp(w) by a functional F(q). (In what follows, we slightly abuse notation by 
using j both as an index and as a random variable representing the identity of the mixture 
component responsible for generating a particular data point.) 



lnp(w|0) = ln^ / p(w, v,j\9) dv 

> J2 J v qln P ^ q m dv = F( W \q,6) 

> F(w\q,9) = / <?(v,j|w) [lnp(w, v,j\6) - lng(v,j|w)] dv 
lnp(w\0) > F(w\q,0) = {hip(w,v,j\0)) q + H(q) , (Al) 

where 9 represents the set of model parameters, and Ti is the entropy of the distribution q. 
Our strategy is now coordinate maximization of F. 

In the E step we maximize F with respect to the auxiliary distribution q. It is easy 
to show (for example by checking that it saturates the bound on F) that the maximizing 
distribution q is the conditional distribution of v and j given the observations w and the 
current parameters: 

E step: q(v, j) <— argmax g F(w|g, 9) = p(v, j\w, 9) . (A2) 

In the M step we maximize F with respect to the parameters 9. This maximization 
reduces to maximization of the expected complete log likelihood under the current variational 
distribution (since the entropy of q does not depend on 9): 

M step: 9 <— argmax g F(w|g, 9) = argmax g ^ / q(v, j) lnp(w, v, j\9) dv . (A3) 
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A.l. EM algorithm for projected mixtures 

For the projected mixtures model the parameters consist of the mixture component 
amplitudes aij, means mj, and variances V,-. The auxiliaries are posterior distributions for 
each star i: q(j\wi) over mixture components and q(v\j, Wj) over the true velocity (given the 
observed projected velocity, assuming it came from a specific component). In the terninology 
of this Appendix, the parameters of the probability distribution for the true 3-space velocity 
v of each star, including especially the probabilities that the star was drawn from each 
of the gaussian velocity components j, are the "auxiliary quantities," and the parameters 
(amplitudes, means, and velocity variance tensors) of the velocity components j are the 
"model parameters." 

First we consider the E step, in which the auxiliary distributions q(v : j\w i: 9) are op- 
timized. In the case of projected mixtures, the posterior over v,j given w and the model 
parameters 9 is itself a conditional mixture of Gaussians: 

E step: p(v,j|w) = pO'K) p(v|j, w) 



pO'I w <) 



P(v|wi,j) = A^(v|bij,By) 

bij = m ; + Vj RJ T^. 1 (w - Ri m,-) 

B l3 = V, V ; R, T, ; ' R, V ; ; (A4) 

Thus, the optimal choice for q(v : j\w i: 9) is: 

9(v,j|wi) = q(j\vfi)q(v\j,Wi) 

901 w i) = Qij=P(J\ w i) 

q{v\wi,j) = A^(v|b^,By) (A5) 

For the M step updates, we must explicitly write out the form of the functional F and 
take its partial derivatives with respect to each set of model parameters: 

^ = E <lnp(w,|v,) + lnp(vi|j) + hi p{j)) q . + 

i 

= ~\ E E [<( w * - ^v) T Sr 1 (w, - R,v)(v - mjyV-'iv - m,-))^,^ 
+ In det Sj + In det Vj + log ctjj 
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= "a E E ^ [w^SrV* + mTV^m,- - 2wTSr 1 R i b y - 2mJVr 1 b y 
+ Trace [(R^S^R, + V ; 'lib,,!,,, + By)] 

+ In det Si + In det V,- + log a*] (A6) 

Deriving the E step and M step for the particular projected mixtures model given above 
leads to the following update equations: 



M step: aij <- ^ gy 

i 



Vj <- — ftj [(m, - by) (mj - by) T + By] 
Ty <- RiVj-R^ + Si . (A7) 

Some care must be taken to implement these equations in a numerically stable way. In 
particular, care should be taken to avoid underflow when computing the ratio of a small 
probability over the sum of other small probabilities. Notice that we don't have to explicitly 
enforce constraints on parameters, e.g., keeping covariances symmetric and positive definite, 
since this is taken care of by the updates. For example, the update equation for V,- is 
guaranteed by its form to produce a symmetric nonnegative definite matrix. 
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Fig. 1. — The best-fit parameters of the model as a function of stellar color for the 20 color- 
selected subsamples. The off-diagonal elements of the velocity variance tensor Vdisk have 
been scaled by square-roots of products of the diagonal elements. 



Table 1. example parameter set for the subsample with 0.654 < (B — V) < 0.685 



parameter value units correlation matrix of the (squared) uncertainties 



x T v disk 


-9.3 ± 1.9 


kms" 


l 


1.00 


0.20 


0.05 


0.10 


-0.14 


0.38 


-0.49 


-0.30 


0.16 


-0.53 


y T Vdisk 


-23.2 ± 1.3 


kms" 


l 


0.20 


1.00 


0.39 


-0.23 


-0.37 


-0.08 


0.12 


0.18 


-0.07 


0.24 


z T v disk 


-8.9 ± 1.1 


kms" 


l 


0.05 


0.39 


1.00 


-0.42 


-0.17 


0.03 


0.09 


0.28 


-0.30 


-0.02 


x T V disk x 


1329. ± 95. 


km 2 s" 


-2 


0.10 


-0.23 


-0.42 


1.00 


0.37 


-0.04 


-0.10 


-0.19 


0.36 


-0.24 


y T Vdisk y 


474. ± 58. 


km 2 s" 


-2 


-0.14 


-0.37 


-0.17 


0.37 


1.00 


-0.30 


0.21 


0.07 


0.53 


-0.02 


z T v disk z 


418. ±47. 


km 2 s" 


-2 


0.38 


-0.08 


0.03 


-0.04 


-0.30 


1.00 


0.04 


0.02 


-0.05 


-0.47 


x T V disk y 


95. ±46. 


km 2 s" 


-2 


-0.49 


0.12 


0.09 


-0.10 


0.21 


0.04 


1.00 


0.27 


0.00 


0.15 


x T V disk z 


11. ±56. 


km 2 s" 


-2 


-0.30 


0.18 


0.28 


-0.19 


0.07 


0.02 


0.27 


1.00 


-0.54 


0.19 


y T v disk z 


32. ± 44. 


km 2 s" 


-2 


0.16 


-0.07 


-0.30 


0.36 


0.53 


-0.05 


0.00 


-0.54 


1.00 


0.05 


«halo 


0.0081 ±0.0052 






-0.53 


0.24 


-0.02 


-0.24 


-0.02 


-0.47 


0.15 


0.19 


0.05 


1.00 ^ 



I 
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Fig. 2. — The mean velocity v^sk a s a function of total velocity variance tr(Vdi S k) for the 
determination of the local standard of rest (LSR). In each panel, the ellipses indicate the one- 
sigma uncertainty regions (from bootstrap resampling — see text) of the measurements. The 
linear fit of V vs. S 2 was performed with the projected Gaussian mixtures algorithm because 
it accounts correctly for the finite errors in both dimensions (see text). The point shown in 
grey was excluded from the fit because the stars in that subsample are very short-lived (see 
text). The reported uncertatinties on the intercepts are from 20 bootstrap resamplings of 
the ellipsoidal points shown. 
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Fig. 3. — The vertex deviation (see text for definition) as a function of color. 
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Fig. 4. — Same as Figure 1, but for the artificial data (see text). The dotted lines indicate 
the input values used to make the artificial data. 
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Fig. 5. — Same as Figure 2, but for the artificial data (see text). In each panel there is a 
dotted line indicating the input values used to make the artificial data. 
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Fig. 6. — Same as Figure 2, but for the generalized model in which the disk velocity distri- 
bution is fit with a mixture of -ft'disk Gaussian ellipsoids with common mean. Each datapoint 
shows the result for that subsample for the optimal jackknife value of K disk (see text), and 
is labeled by that value of -K" disk . 



